Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Training-free Open-world Segmentation via Image Prompt Foundation Models (2310.10912v3)

Published 17 Oct 2023 in cs.CV

Abstract: The realm of computer vision has witnessed a paradigm shift with the advent of foundational models, mirroring the transformative influence of LLMs in the domain of natural language processing. This paper delves into the exploration of open-world segmentation, presenting a novel approach called Image Prompt Segmentation (IPSeg) that harnesses the power of vision foundational models. IPSeg lies the principle of a training-free paradigm, which capitalizes on image prompt techniques. Specifically, IPSeg utilizes a single image containing a subjective visual concept as a flexible prompt to query vision foundation models like DINOv2 and Stable Diffusion. Our approach extracts robust features for the prompt image and input image, then matches the input representations to the prompt representations via a novel feature interaction module to generate point prompts highlighting target objects in the input image. The generated point prompts are further utilized to guide the Segment Anything Model to segment the target object in the input image. The proposed method stands out by eliminating the need for exhaustive training sessions, thereby offering a more efficient and scalable solution. Experiments on COCO, PASCAL VOC, and other datasets demonstrate IPSeg's efficacy for flexible open-world segmentation using intuitive image prompts. This work pioneers tapping foundation models for open-world understanding through visual concepts conveyed in images.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Brown T, Mann B, Ryder N, et al (2020) Language models are few-shot learners. Adv Neural Inform Process Syst 33:1877–1901 Bucher et al (2019) Bucher M, Vu TH, Cord M, et al (2019) Zero-shot semantic segmentation. Advances in Neural Information Processing Systems 32 Caesar et al (2018) Caesar H, Uijlings JRR, Ferrari V (2018) Coco-stuff: Thing and stuff classes in context. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE Computer Society, pp 1209–1218 Cen et al (2021) Cen J, Yun P, Cai J, et al (2021) Deep metric learning for open world semantic segmentation. In: Int. Conf. Comput. Vis., pp 15,333–15,342 Cen et al (2023) Cen J, Zhou Z, Fang J, et al (2023) Segment anything in 3d with nerfs. arXiv preprint arXiv:230412308 Cha et al (2023) Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Bucher M, Vu TH, Cord M, et al (2019) Zero-shot semantic segmentation. Advances in Neural Information Processing Systems 32 Caesar et al (2018) Caesar H, Uijlings JRR, Ferrari V (2018) Coco-stuff: Thing and stuff classes in context. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE Computer Society, pp 1209–1218 Cen et al (2021) Cen J, Yun P, Cai J, et al (2021) Deep metric learning for open world semantic segmentation. In: Int. Conf. Comput. Vis., pp 15,333–15,342 Cen et al (2023) Cen J, Zhou Z, Fang J, et al (2023) Segment anything in 3d with nerfs. arXiv preprint arXiv:230412308 Cha et al (2023) Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Caesar H, Uijlings JRR, Ferrari V (2018) Coco-stuff: Thing and stuff classes in context. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE Computer Society, pp 1209–1218 Cen et al (2021) Cen J, Yun P, Cai J, et al (2021) Deep metric learning for open world semantic segmentation. In: Int. Conf. Comput. Vis., pp 15,333–15,342 Cen et al (2023) Cen J, Zhou Z, Fang J, et al (2023) Segment anything in 3d with nerfs. arXiv preprint arXiv:230412308 Cha et al (2023) Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cen J, Yun P, Cai J, et al (2021) Deep metric learning for open world semantic segmentation. In: Int. Conf. Comput. Vis., pp 15,333–15,342 Cen et al (2023) Cen J, Zhou Z, Fang J, et al (2023) Segment anything in 3d with nerfs. arXiv preprint arXiv:230412308 Cha et al (2023) Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cen J, Zhou Z, Fang J, et al (2023) Segment anything in 3d with nerfs. arXiv preprint arXiv:230412308 Cha et al (2023) Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  2. Bucher M, Vu TH, Cord M, et al (2019) Zero-shot semantic segmentation. Advances in Neural Information Processing Systems 32 Caesar et al (2018) Caesar H, Uijlings JRR, Ferrari V (2018) Coco-stuff: Thing and stuff classes in context. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE Computer Society, pp 1209–1218 Cen et al (2021) Cen J, Yun P, Cai J, et al (2021) Deep metric learning for open world semantic segmentation. In: Int. Conf. Comput. Vis., pp 15,333–15,342 Cen et al (2023) Cen J, Zhou Z, Fang J, et al (2023) Segment anything in 3d with nerfs. arXiv preprint arXiv:230412308 Cha et al (2023) Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Caesar H, Uijlings JRR, Ferrari V (2018) Coco-stuff: Thing and stuff classes in context. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE Computer Society, pp 1209–1218 Cen et al (2021) Cen J, Yun P, Cai J, et al (2021) Deep metric learning for open world semantic segmentation. In: Int. Conf. Comput. Vis., pp 15,333–15,342 Cen et al (2023) Cen J, Zhou Z, Fang J, et al (2023) Segment anything in 3d with nerfs. arXiv preprint arXiv:230412308 Cha et al (2023) Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cen J, Yun P, Cai J, et al (2021) Deep metric learning for open world semantic segmentation. In: Int. Conf. Comput. Vis., pp 15,333–15,342 Cen et al (2023) Cen J, Zhou Z, Fang J, et al (2023) Segment anything in 3d with nerfs. arXiv preprint arXiv:230412308 Cha et al (2023) Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cen J, Zhou Z, Fang J, et al (2023) Segment anything in 3d with nerfs. arXiv preprint arXiv:230412308 Cha et al (2023) Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  3. Caesar H, Uijlings JRR, Ferrari V (2018) Coco-stuff: Thing and stuff classes in context. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE Computer Society, pp 1209–1218 Cen et al (2021) Cen J, Yun P, Cai J, et al (2021) Deep metric learning for open world semantic segmentation. In: Int. Conf. Comput. Vis., pp 15,333–15,342 Cen et al (2023) Cen J, Zhou Z, Fang J, et al (2023) Segment anything in 3d with nerfs. arXiv preprint arXiv:230412308 Cha et al (2023) Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cen J, Yun P, Cai J, et al (2021) Deep metric learning for open world semantic segmentation. In: Int. Conf. Comput. Vis., pp 15,333–15,342 Cen et al (2023) Cen J, Zhou Z, Fang J, et al (2023) Segment anything in 3d with nerfs. arXiv preprint arXiv:230412308 Cha et al (2023) Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cen J, Zhou Z, Fang J, et al (2023) Segment anything in 3d with nerfs. arXiv preprint arXiv:230412308 Cha et al (2023) Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  4. Cen J, Yun P, Cai J, et al (2021) Deep metric learning for open world semantic segmentation. In: Int. Conf. Comput. Vis., pp 15,333–15,342 Cen et al (2023) Cen J, Zhou Z, Fang J, et al (2023) Segment anything in 3d with nerfs. arXiv preprint arXiv:230412308 Cha et al (2023) Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cen J, Zhou Z, Fang J, et al (2023) Segment anything in 3d with nerfs. arXiv preprint arXiv:230412308 Cha et al (2023) Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  5. Cen J, Zhou Z, Fang J, et al (2023) Segment anything in 3d with nerfs. arXiv preprint arXiv:230412308 Cha et al (2023) Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  6. Cha J, Mun J, Roh B (2023) Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 11,165–11,174 Chen et al (2017) Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  7. Chen LC, Papandreou G, Kokkinos I, et al (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848 Chen et al (2023) Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  8. Chen T, Mai Z, Li R, et al (2023) Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation. arXiv preprint arXiv:230505803 Cheng et al (2021) Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  9. Cheng J, Nandi S, Natarajan P, et al (2021) SIGN: spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 9536–9546 Cheng et al (2023) Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  10. Cheng Y, Li L, Xu Y, et al (2023) Segment and track anything. arXiv preprint arXiv:230506558 Chowdhery et al (2023) Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  11. Chowdhery A, Narang S, Devlin J, et al (2023) Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240):1–113 Cui et al (2020) Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  12. Cui Z, Longshi W, Wang R (2020) Open set semantic segmentation with statistical test and adaptive threshold. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 Dai et al (2023) Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  13. Dai W, Li J, Li D, et al (2023) Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR abs/2305.06500 Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  14. Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Everingham et al (2010) Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  15. Everingham M, Gool LV, Williams CKI, et al (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Ghiasi et al (2022) Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  16. Ghiasi G, Gu X, Cui Y, et al (2022) Scaling open-vocabulary image segmentation with image-level labels. In: Eur. Conf. Comput. Vis., vol 13696. Springer, pp 540–557 Gu et al (2020) Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  17. Gu Z, Zhou S, Niu L, et al (2020) Context-aware feature generation for zero-shot semantic segmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 1921–1929 Hammam et al (2023) Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  18. Hammam A, Bonarens F, Ghobadi SE, et al (2023) Identifying out-of-domain objects with dirichlet deep neural networks. In: Int. Conf. Comput. Vis., pp 4560–4569 Ho et al (2020) Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  19. Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inform Process Syst 33:6840–6851 Jiang and Yang (2023) Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  20. Jiang PT, Yang Y (2023) Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation. arXiv preprint arXiv:230501275 Kirillov et al (2023) Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  21. Kirillov A, Mintun E, Ravi N, et al (2023) Segment anything. arXiv preprint arXiv:230402643 Li et al (2022) Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  22. Li J, Li D, Xiong C, et al (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Int. Conf. Mach. Learn., Proceedings of Machine Learning Research, vol 162. PMLR, pp 12,888–12,900 Li et al (2020) Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  23. Li X, Wei T, Chen YP, et al (2020) FSS-1000: A 1000-class dataset for few-shot segmentation. In: Conf. Comput. Vis. Pattern Recog. Computer Vision Foundation / IEEE, pp 2866–2875 Liang et al (2023) Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  24. Liang F, Wu B, Dai X, et al (2023) Open-vocabulary semantic segmentation with mask-adapted CLIP. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 7061–7070 Liu et al (2023a) Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  25. Liu H, Li C, Wu Q, et al (2023a) Visual instruction tuning. CoRR abs/2304.08485 Liu et al (2022) Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  26. Liu Q, Wen Y, Han J, et al (2022) Open-world semantic segmentation via contrasting and clustering vision-language embedding. In: Eur. Conf. Comput. Vis., Springer, pp 275–292 Liu et al (2023b) Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  27. Liu Y, Zhu M, Li H, et al (2023b) Matcher: Segment anything with one shot using all-purpose feature matching. arXiv preprint arXiv:230513310 Long et al (2015) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  28. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 3431–3440 Lu et al (2019) Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  29. Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32 Luo et al (2023) Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  30. Luo H, Bao J, Wu Y, et al (2023) Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In: Int. Conf. Mach. Learn., PMLR, pp 23,033–23,044 Ma et al (2022) Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  31. Ma C, Yang Y, Wang Y, et al (2022) Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:221015138 Mottaghi et al (2014) Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  32. Mottaghi R, Chen X, Liu X, et al (2014) The role of context for object detection and semantic segmentation in the wild. In: Conf. Comput. Vis. Pattern Recog. IEEE Computer Society, pp 891–898 Nguyen and Todorovic (2019) Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  33. Nguyen K, Todorovic S (2019) Feature weighting and boosting for few-shot segmentation. In: Int. Conf. Comput. Vis. IEEE, pp 622–631 Oin et al (2023) Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  34. Oin J, Wu J, Yan P, et al (2023) Freeseg: Unified, universal and open-vocabulary image segmentation. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 19,446–19,455 Oquab et al (2023) Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  35. Oquab M, Darcet T, Théo Moutakanni ea (2023) Dinov2: Learning robust visual features without supervision. CoRR Qi et al (2022) Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  36. Qi L, Kuen J, Wang Y, et al (2022) Open world entity segmentation. IEEE Trans Pattern Anal Mach Intell Radford et al (2018) Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  37. Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI Radford et al (2019) Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  38. Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9 Radford et al (2021) Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  39. Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: ICML, Proceedings of Machine Learning Research, vol 139. PMLR, pp 8748–8763 Ranftl et al (2021) Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  40. Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Conf. Comput. Vis. Pattern Recog., pp 12,179–12,188 Rombach et al (2022) Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  41. Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Conf. Comput. Vis. Pattern Recog. Shen et al (2023) Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  42. Shen Q, Yang X, Wang X (2023) Anything-3d: Towards single-view anything reconstruction in the wild. arXiv preprint arXiv:230410261 Tang et al (2023) Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  43. Tang L, Xiao H, Li B (2023) Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:230404709 Touvron et al (2023) Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  44. Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971 Wang et al (2023) Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  45. Wang X, Wang W, Cao Y, et al (2023) Images speak in images: A generalist painter for in-context visual learning. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 6830–6839 Xia et al (2020) Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  46. Xia Y, Zhang Y, Liu F, et al (2020) Synthesize then compare: Detecting failures and anomalies for semantic segmentation. In: Eur. Conf. Comput. Vis., Springer, pp 145–161 Xian et al (2019) Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  47. Xian Y, Choudhury S, He Y, et al (2019) Semantic projection network for zero-and few-label semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 8256–8265 Xu et al (2022a) Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  48. Xu J, Mello SD, Liu S, et al (2022a) Groupvit: Semantic segmentation emerges from text supervision. In: Conf. Comput. Vis. Pattern Recog. IEEE, pp 18,113–18,123 Xu et al (2022b) Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  49. Xu M, Zhang Z, Wei F, et al (2022b) A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In: Eur. Conf. Comput. Vis., Springer, pp 736–753 Xu et al (2023) Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  50. Xu M, Zhang Z, Wei F, et al (2023) Side adapter network for open-vocabulary semantic segmentation. In: Conf. Comput. Vis. Pattern Recog., pp 2945–2954 Yang et al (2023) Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  51. Yang J, Gao M, Li Z, et al (2023) Track anything: Segment anything meets videos. arXiv preprint arXiv:230411968 Zhang and Liu (2023) Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  52. Zhang K, Liu D (2023) Customized segment anything model for medical image segmentation. arXiv preprint arXiv:230413785 Zhang et al (2023a) Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  53. Zhang R, Han J, Zhou A, et al (2023a) Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:230316199 Zhang et al (2023b) Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  54. Zhang R, Jiang Z, Guo Z, et al (2023b) Personalize segment anything model with one shot. arXiv preprint arXiv:230503048 Zhang et al (2022) Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  55. Zhang S, Roller S, Goyal N, et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:220501068 Zhou et al (2022) Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  56. Zhou C, Loy CC, Dai B (2022) Extract free dense labels from clip. In: Eur. Conf. Comput. Vis., Springer, pp 696–712 Zhou et al (2023a) Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  57. Zhou H, Qiao B, Yang L, et al (2023a) Texture-guided saliency distilling for unsupervised salient object detection. In: Conf. Comput. Vis. Pattern Recog. Zhou et al (2023b) Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  58. Zhou Z, Lei Y, Zhang B, et al (2023b) Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog., pp 11,175–11,185 Zhu et al (2023a) Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  59. Zhu D, Chen J, Shen X, et al (2023a) Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR abs/2304.10592 Zhu et al (2023b) Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974 Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
  60. Zhu J, Chen Z, Hao Z, et al (2023b) Tracking anything in high quality. arXiv preprint arXiv:230713974
Citations (5)

Summary

We haven't generated a summary for this paper yet.